Finally, for more information, check out all of these resources:
Base R plots are incredibly useful and give great quick visualizations during your analysis workflow. They are incredibly useful to know how to use, even when you prefer to plot using another plotting package like ggplot2.
Histograms are extremely useful for a first glimpse at a data set. For a quick and dirty plot, simply send a string of numbers to the function, and let it sort out the details.
data(mtcars)
hist(mtcars$mpg)
Prettier versions can also be created, specifying the number of categories (breaks). You can also set graphical parameters such as axis labels, and limits (more on these in a moment).
hist(mtcars$hp,
breaks = 50,
xlab = "Horsepower (hp)",
main = "Histogram of car horsepower, 1974 Motor Trend Car Road Tests",
ylim = c(0,10))
Base R has a wide array of graphical paramaters that you can set to tune the appearance of your graphics. They can be set within plotting function calls. In the above code, xlab, and ylim are graphical parameters that are set within the call to the hist function.
Even more of them can also be set with a call to the function ‘par’. See the help files for an exhaustive (and sometimes exhausting) list of base R graphical parameters.
The ones I use most frequently adjust the number of graphing panels in a graphics device (mfrow, or mfcol set these), the size of the graphing margins (mar), text sizes (cex), and colors (col). For some of these, you can set them to apply to a single element of the graphics (e.g., cex.axis affects the font size of the axis labels, and col.main affects the color of the main title of the graph).
For multi-panel graphics where you need to fine-tune the layout of the panels within the graphics device, the function layout is also useful.
Boxplots can describe a single column of data, multiple columns of data, or a single column of data split in to categories with a function call.
par(mfrow =c(1,3))
boxplot(mtcars$mpg,
ylab = "Miles Per Gallon",
main = "Gas Mileage, All cars")
boxplot(mtcars[,c("wt","drat")],
ylab = "Value",
cex.main = .75,
main = "Car features that Emilie is unfamiliar with")
boxplot(mpg~cyl, ## data to display is described as a function call. In this case, miles per gallon as a function of how many cylinders
data = mtcars,
xlab = "Number of cylinders",
ylab = "Miles per gallon",
cex.axis = 1.25,
cex.main = .8,
cex.lab = 1.75,
main = "Gas Mileage, by number of cylinders" )
Barplots are an incredibly useful tool for data display. For quick and dirty barplots, base R’s functionality is great.
The table function on categorical variables will give you summaries that can be used to feed the ‘barplot’ function.
The following simple code examples should be enough to get you started with barplots.
tab1<-table(mtcars$cyl)
barplot(tab1,
xlab = "Number of cylinders",
ylab = "Number of car models")
tab2<-table(mtcars$cyl,mtcars$gear)
tab2
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
barplot(tab2,
xlab = "Number of cylinders",
ylab = "Number of car models",
col =c("forestgreen","royalblue","yellow"),
Main = "Number of gears",
legend = T)
You may have noticed that my barplots lack error bars. It is possible to add error bars to base R barplots, but it is cumbersome and awkward. ‘ggplot2’ handles error bars in barplots almost seamlessly. This is incentive enough to learn ‘ggplot2’ on its own.
There is a good online discussion of this very problem on the web.
https://www.r-bloggers.com/building-barplots-with-error-bars/
So far, we have put only one R graph in each plot. It is possible to add other features to existing plots with supplemental functions, and also sometimes by setting the graphical parameter ‘add’ to TRUE within a (second or third) plotting function call.
Most of these will be illustrated in the following plots.
plot(mtcars$mpg ~ mtcars$hp,
type = "p", # point = p, also "o", "l", "b"
col = "red",
lwd = 2, # line weight (or point weight)
xlab = "MPG",
ylab = "HP",
main = "HP vs MPG, 1974 Motor Trend Car Road Tests")
abline(lm(mpg~hp, data=mtcars), lty = "dashed") # abline(a = intercept, b=slope)
# add in some annotations
text(x = 250, y=25, labels = "Adj R2 = 0.589\nP=1.788e-7" )
data(iris)
plot(iris$Sepal.Length, iris$Petal.Length, # x variable, y variable
col = iris$Species, # colour by species
pch = 16, # type of point to use
cex = 2, # size of point to use
xlab = "Sepal Length", # x axis label
ylab = "Petal Length", # y axis label
main = "Flower Characteristics in Iris") # plot title
legend (x = 4.5, y = 7, legend = levels(iris$Species), col = c(1:3), pch = 16)
For many packages, R programmers have written methods to the ‘plot’ function that interact with the objects that their packages create. These can be called by the generic function ‘plot’, or by each method’s specific name (e.g., ‘plot.lm’ for the plot function that interacts with a simple linear regression model object)
‘plot.lm’ is a useful case to illustrate this functionality.
When running regression models, many researchers use diagnostic plots to confirm their model selection choices. While this is possible to create in ggplot2 using the ggfortify package, base R plotting handles this deftly without additional packages.
While the resulting display may not something you wish to publish, it is an extremely useful tool for the model building proces.
fit1 <- lm(Sepal.Length~ Petal.Length, data=iris)
# change layout to 2x2 panel
par(mfrow = c(2,2)) # change to 2x2
plot(fit1) # plot fitting diagnostics
par(mfrow = c(1, 1)) # change back to 1x1
There are three basic ways to export graphics files when you are working in RStudio.
You can use the ‘Export’ button just above the plotting panel in the RStudio interface. This simply saves the current content of the plotting window to a file. This is cumbersome for large numbers of graphics, and doesn’t yield high quality images.
If you are working in a Windows operating system, and not using RStudio, the ‘savePlot’ function is a straightforward way to save graphic images from windows devices in a scripting context. Since we are working within RStudio, we won’t elaborate on this today.
Base R’s functions opening specific types of graphics devices (bmp jpeg, png and tiff) allow for fine-tuned formatting and configuration of output graphics. Graphic construction with this method is a three-step process:
dev.off.The last method is powerful enough to create publication-quality graphics. However, it isn’t the most straightforward (Aaron thinks ggplot2 does it better, and Emilie agrees, but learned this method eons ago).
Here is an example of method 3 to create a ‘.png’ graphics file.
png("Sepal vs Petal Length in Iris.png", width = 500, height = 500, res = 72)
plot(iris$Sepal.Length, iris$Petal.Length,
col = iris$Species,
main = "Sepal vs Petal Length in Iris")
dev.off()
## png
## 2
Base R has a powerful plotting engine, and it is widely used in many existing packages. However, using it can be convoluted, and it takes an adept user to produce publication quality graphs.
ggplot2 is a far more powerful tool to use for data exploration, and for the production of high-quality graphics to illustrate complex data sets.
ggplot2In the tidyverse collection there is a package called ggplot2. It provides a “grammar of graphics” approach to plotting, where you create plots layer by layer and arrange the plots as you see fit.
For more information: http://r4ds.had.co.nz/data-visualisation.html > And if you really want the theoretical underpinnings of ggplot2, read this: http://vita.had.co.nz/papers/layered-grammar.pdf
The basic grammatic structure of plotting with ggplot two involves three basic pieces:
A basic call to the ‘ggplot’ function, which specifies things like data to use, and aesthetic mapping (data columns to use, colors, and symbols and other things specified by ‘aes’).
a + character to link together all of the pieces of the plot. ggplot2 predates the rest of the tidyverse by a few years, and came before the introduction of the pipeline operator (%>%). This is a major point of confusion among people just learning the tidyverse.
functions that specify the type of geometry to use for plotting portions of the data defined in the basic call to ggplot. It is possible to use many of these within a single plot (linked with a ‘+’). Aesthetic mappings may also be specified here.
The main thing to understand with ggplot2 is graphs are built layer-by-layer using geom_*() functions that specify what geometry to use. Within each geometry, you can specify values for x, y, what color to fill or outline x and y with, and what groupings to use. Examples of this are in the following code chunks.
Here is how those three pieces appear in R:
ggplot(data = MYDATA, aes(X=MY_X, Y=MY_Y, color=MY_COLOR)) +
geom_() # Many types of geom in ggplot2.
library(ggplot2)
library(nycflights13) # our dataset
data(flights)
ggplot(data = flights, aes(x=month)) + # call ggplot() to start the plot.
# Specify data, and that x=month.
geom_bar()
The aes() function is the asthetic mapping of the plot. In this example, aes is called in the first ggplot() function and is then applied globally to all geoms below it. If you put aes(x=month) in geom_bar() such that geom_bar(aes(x=month)) it will only apply the asthetic mapping to that layer of the graph. You could then choose to add another layer, like geom_text() and specify a new value of x to use with that layer.
# better, cleaned up version of the graph
ggplot(flights, aes(factor(month))) +
geom_bar() +
ggtitle('NY Flights by Month, 2013') + # ggtitle() is a shortcut function to get a title
xlab('Month') + # sets the x-axis label
ylab('Count') + # y-axis label
theme_minimal() # applies a theme to ggplot that changes a lot of things.
In this version, I’ve cleaned up the plot a bit by giving proper labels and titles, and applying a theme (theme_minimal()). There are many themes in ggplot2 and even more in the ggthemes package.
You can also create your own themes and save them. This is incredibly useful if you use branding on your images and want to include common colors, a company/agency logo, or other things on all of your plots. For more information:
http://ggplot2.tidyverse.org/reference/theme.html
And for some fun aimed at fans of the web comic XKCD:
fill and colorPart of the aes() function is the fill and color arguments. These apply aesthetics to grouping variables. color can be thought of as the outline of an object, whereas fill is the actual color filling it. You don’t want to use color on something like a geom_bar() plot because it only outlines the bars. It is used more on geom_point(), for example. See the following 2 plots for a visual explanation of what I mean.
p1 <- ggplot(flights, aes(factor(month)))
p1 + geom_bar(aes(color=carrier))
In this case the carrier is still gray-filled but has a unique color outline.
p1 + geom_bar(aes(fill=carrier))
This is more of what we want. The entire bar is filled. But sometimes stacked bars look a little too busy to convey the right information. This is where the customization options in ggplot2 really help.
position_dodge()Maybe we’re more interested in looking only at a the summer months when travel picks up, and we care about the number of flights each airline makes during those months. We can subset and display our bar chart differently to highlight what we want to show the audience.
ggplot(subset(flights, month == c(6,7,8)), aes(x=factor(month))) +geom_bar(aes(fill=carrier), position = position_dodge())
Notice how the subset() function was called within the data argument of the call to ggplot(). Your original data source remains unchanged, but you only display the subset of the data relevant to your work.
In this example, we’re plotting arrival and departure delays for flights from the big 3 airlines; Delta (DL), United Airlines (UA), and American Airlines (AA). The geom_point() geom is used for scatter plots.
ggplot(filter(flights, carrier == c('AA', 'DL', 'UA'))) + geom_point(aes(x=dep_delay, y=arr_delay, color=carrier))
How to add error bars in ggplot2 isn’t always immediately clear, but knowing how to do it will pay dividends down the road.
More here: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)
se<-function(x){x<-x[!is.na(x)];sem<-sd(x)/sqrt(length(x))} ## a little function calculating a standard error.
flightsum<-flights %>% group_by(carrier)%>% summarise(delay = mean(dep_delay,na.rm = T),se = se(dep_delay))
ggplot(flightsum, aes(x=carrier, y=delay)) +
geom_col(position=position_dodge()) +
geom_errorbar(aes(ymin=delay-se, ymax=delay+se),
width=.2, # Width of the error bars
position=position_dodge(.9))
dplyr to arrange data before plottingThe carrier field in our flights data uses carrier codes to identify each carrier. What if we wanted the actual names for the carriers so we could make a publication-quality plot? ggplot2 fits nicely in the workflow of the tidyverse packages and allow you to manipulate and join your source data before plotting, and without munging your source data in the process.
data(airlines)
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
data(flights)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
flights %>% select(carrier, dep_delay, arr_delay) %>% left_join(airlines, by="carrier") %>% filter(carrier == c("AA","UA", "DL")) %>%
ggplot() + geom_point(aes(x=dep_delay, y=arr_delay, color=name), alpha=.5) +
ggtitle('Arrival and departure delays, by Airline') +
xlab('Departure delay, minutes') +
ylab('Arrival delay, minutes') +
theme_minimal()
flights %>% select(carrier, dep_delay, arr_delay, origin) %>% left_join(airlines, by="carrier") %>% filter(carrier == c("AA","UA", "DL")) %>%
ggplot() + geom_point(aes(x=dep_delay, y=arr_delay, color=name), alpha=.5) +
ggtitle('Arrival and departure delays, by Airline') +
xlab('Departure delay, minutes') +
ylab('Arrival delay, minutes') +
theme_minimal() +
facet_wrap(~origin)
ggplot2 graphsLet’s use mtcars which is a collection of stats on various makes and models of car.
data(mtcars)
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Simple plot of hp and mpg, with the cyl (cylinders) color-coded. A lowess regression has been overlayed.
ggplot(mtcars) + geom_point(aes(hp, mpg, color = factor(cyl))) + # cyl is numeric in this dataset,
# but really is categorical rather than continuous
# so we'll coerce it to be a factor with factor(cyl)
xlab('Horsepower') +
ylab('MPG') +
geom_smooth(aes(hp, mpg), method = "loess") +
theme_minimal()
This is the same smoothing, but using a grouping variable (cyl).
ggplot(mtcars, aes(x=hp, y=mpg, color=factor(cyl))) +
geom_point() +
geom_smooth(method=lm) +
theme_minimal()
There are a lot of choices here, each with advantages and disadvantages. But for brevity, we’re going to just look at Plotly.js, built on D3.js. The reason I choose this one is primarily due to the connection with ggplot2, and the fact that the R Library is free and open-source.
ggplotlylibrary(plotly)
p1 <- ggplot(mtcars, aes(x=hp, y=mpg, color=factor(cyl))) +
geom_point() +
geom_smooth(method=lm) + theme_minimal()
ggplotly(p1)
library(plotly)
library(rjson)
json_file <- "https://raw.githubusercontent.com/plotly/plotly.js/master/test/image/mocks/sankey_energy.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
p <- plot_ly(
type = "sankey",
domain = c(
x = c(0,1),
y = c(0,1)
),
orientation = "h",
valueformat = ".0f",
valuesuffix = "TWh",
node = list(
label = json_data$data[[1]]$node$label,
color = json_data$data[[1]]$node$color,
pad = 15,
thickness = 15,
line = list(
color = "black",
width = 0.5
)
),
link = list(
source = json_data$data[[1]]$link$source,
target = json_data$data[[1]]$link$target,
value = json_data$data[[1]]$link$value,
label = json_data$data[[1]]$link$label
)
) %>%
layout(
title = "Energy forecast for 2050<br>Source: Department of Energy & Climate Change, Tom Counsell via <a href='https://bost.ocks.org/mike/sankey/'>Mike Bostock</a>",
font = list(
size = 10
),
height = 850,
width = 800,
xaxis = list(showgrid = F, zeroline = F),
yaxis = list(showgrid = F, zeroline = F)
)
p